{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Neural network hybrid recommendation system on Google Analytics data model and training\n",
    "\n",
    "This notebook demonstrates how to implement a hybrid recommendation system using a neural network to combine content-based and collaborative filtering recommendation models using Google Analytics data. We are going to use the learned user embeddings from [wals.ipynb](../wals.ipynb) and combine that with our previous content-based features from [content_based_using_neural_networks.ipynb](../content_based_using_neural_networks.ipynb)\n",
    "\n",
    "Now that we have our data preprocessed from BigQuery and Cloud Dataflow, we can build our neural network hybrid recommendation model to our preprocessed data. Then we can train locally to make sure everything works and then use the power of Google Cloud ML Engine to scale it out."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We're going to use TensorFlow Hub to use trained text embeddings, so let's first pip install that and reset our session."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip3 install tensorflow_hub"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "pip install --upgrade tensorflow"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now reset the notebook's session kernel! Since we're no longer using Cloud Dataflow, we'll be using the python3 kernel from here on out so don't forget to change the kernel if it's still python2."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import helpful libraries and setup our project, bucket, and region\n",
    "import os\n",
    "import tensorflow as tf\n",
    "import tensorflow_hub as hub\n",
    "\n",
    "PROJECT = \"cloud-training-demos\" # REPLACE WITH YOUR PROJECT ID\n",
    "BUCKET = \"cloud-training-demos-ml\" # REPLACE WITH YOUR BUCKET NAME\n",
    "REGION = \"us-central1\" # REPLACE WITH YOUR BUCKET REGION e.g. us-central1\n",
    "\n",
    "# do not change these\n",
    "os.environ[\"PROJECT\"] = PROJECT\n",
    "os.environ[\"BUCKET\"] = BUCKET\n",
    "os.environ[\"REGION\"] = REGION\n",
    "os.environ[\"TFVERSION\"] = \"1.13\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "gcloud config set project $PROJECT\n",
    "gcloud config set compute/region $REGION"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "if ! gcloud storage ls | grep -q gs://${BUCKET}/hybrid_recommendation/preproc; then\n",    "    gcloud storage buckets create --location ${REGION} gs://${BUCKET}\n",    "    # copy canonical set of preprocessed files if you didn't do preprocessing notebook\n",
    "    gcloud storage cp --recursive gs://cloud-training-demos/courses/machine_learning/deepdive/10_recommendation/hybrid_recommendation gs://${BUCKET}\n",    "fi"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h2> Create hybrid recommendation system model using TensorFlow </h2>\n",
    "\n",
    "Now that we've created our training and evaluation input files as well as our categorical feature vocabulary files, we can create our TensorFlow hybrid recommendation system model."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's first get some of our aggregate information that we will use in the model from some of our preprocessed files we saved in Google Cloud Storage."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from tensorflow.python.lib.io import file_io"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get number of content ids from text file in Google Cloud Storage\n",
    "with file_io.FileIO(tf.gfile.Glob(filename = \"gs://{}/hybrid_recommendation/preproc/vocab_counts/content_id_vocab_count.txt*\".format(BUCKET))[0], mode = 'r') as ifp:\n",
    "    number_of_content_ids = int([x for x in ifp][0])\n",
    "print(\"number_of_content_ids = {}\".format(number_of_content_ids))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get number of categories from text file in Google Cloud Storage\n",
    "with file_io.FileIO(tf.gfile.Glob(filename = \"gs://{}/hybrid_recommendation/preproc/vocab_counts/category_vocab_count.txt*\".format(BUCKET))[0], mode = 'r') as ifp:\n",
    "    number_of_categories = int([x for x in ifp][0])\n",
    "print(\"number_of_categories = {}\".format(number_of_categories))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get number of authors from text file in Google Cloud Storage\n",
    "with file_io.FileIO(tf.gfile.Glob(filename = \"gs://{}/hybrid_recommendation/preproc/vocab_counts/author_vocab_count.txt*\".format(BUCKET))[0], mode = 'r') as ifp:\n",
    "    number_of_authors = int([x for x in ifp][0])\n",
    "print(\"number_of_authors = {}\".format(number_of_authors))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get mean months since epoch from text file in Google Cloud Storage\n",
    "with file_io.FileIO(tf.gfile.Glob(filename = \"gs://{}/hybrid_recommendation/preproc/vocab_counts/months_since_epoch_mean.txt*\".format(BUCKET))[0], mode = 'r') as ifp:\n",
    "    mean_months_since_epoch = float([x for x in ifp][0])\n",
    "print(\"mean_months_since_epoch = {}\".format(mean_months_since_epoch))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Determine CSV and label columns\n",
    "NON_FACTOR_COLUMNS = \"next_content_id,visitor_id,content_id,category,title,author,months_since_epoch\".split(',')\n",
    "FACTOR_COLUMNS = [\"user_factor_{}\".format(i) for i in range(10)] + [\"item_factor_{}\".format(i) for i in range(10)]\n",
    "CSV_COLUMNS = NON_FACTOR_COLUMNS + FACTOR_COLUMNS\n",
    "LABEL_COLUMN = \"next_content_id\"\n",
    "\n",
    "# Set default values for each CSV column\n",
    "NON_FACTOR_DEFAULTS = [[\"Unknown\"],[\"Unknown\"],[\"Unknown\"],[\"Unknown\"],[\"Unknown\"],[\"Unknown\"],[mean_months_since_epoch]]\n",
    "FACTOR_DEFAULTS = [[0.0] for i in range(10)] + [[0.0] for i in range(10)] # user and item\n",
    "DEFAULTS = NON_FACTOR_DEFAULTS + FACTOR_DEFAULTS"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create input function for training and evaluation to read from our preprocessed CSV files."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create input function for train and eval\n",
    "def read_dataset(filename, mode, batch_size = 512):\n",
    "    def _input_fn():\n",
    "        def decode_csv(value_column):\n",
    "            columns = tf.decode_csv(records = value_column, record_defaults = DEFAULTS)\n",
    "            features = dict(zip(CSV_COLUMNS, columns))          \n",
    "            label = features.pop(LABEL_COLUMN)         \n",
    "            return features, label\n",
    "\n",
    "        # Create list of files that match pattern\n",
    "        file_list = tf.gfile.Glob(filename = filename)\n",
    "\n",
    "        # Create dataset from file list\n",
    "        dataset = tf.data.TextLineDataset(filenames = file_list).map(map_func = decode_csv)\n",
    "\n",
    "        if mode == tf.estimator.ModeKeys.TRAIN:\n",
    "            num_epochs = None # indefinitely\n",
    "            dataset = dataset.shuffle(buffer_size = 10 * batch_size)\n",
    "        else:\n",
    "            num_epochs = 1 # end-of-input after this\n",
    "\n",
    "        dataset = dataset.repeat(count = num_epochs).batch(batch_size = batch_size)\n",
    "        return dataset.make_one_shot_iterator().get_next()\n",
    "    return _input_fn"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we will create our feature columns using our read in features."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create feature columns to be used in model\n",
    "def create_feature_columns(args):\n",
    "    # Create content_id feature column\n",
    "    content_id_column = tf.feature_column.categorical_column_with_hash_bucket(\n",
    "        key = \"content_id\",\n",
    "        hash_bucket_size = number_of_content_ids)\n",
    "\n",
    "    # Embed content id into a lower dimensional representation\n",
    "    embedded_content_column = tf.feature_column.embedding_column(\n",
    "        categorical_column = content_id_column,\n",
    "        dimension = args[\"content_id_embedding_dimensions\"])\n",
    "\n",
    "    # Create category feature column\n",
    "    categorical_category_column = tf.feature_column.categorical_column_with_vocabulary_file(\n",
    "        key = \"category\",\n",
    "        vocabulary_file = tf.gfile.Glob(filename = \"gs://{}/hybrid_recommendation/preproc/vocabs/category_vocab.txt*\".format(args[\"bucket\"]))[0],\n",
    "        num_oov_buckets = 1)\n",
    "\n",
    "    # Convert categorical category column into indicator column so that it can be used in a DNN\n",
    "    indicator_category_column = tf.feature_column.indicator_column(categorical_column = categorical_category_column)\n",
    "\n",
    "    # Create title feature column using TF Hub\n",
    "    embedded_title_column = hub.text_embedding_column(\n",
    "        key = \"title\", \n",
    "        module_spec = \"https://tfhub.dev/google/nnlm-de-dim50-with-normalization/1\",\n",
    "        trainable = False)\n",
    "\n",
    "    # Create author feature column\n",
    "    author_column = tf.feature_column.categorical_column_with_hash_bucket(\n",
    "        key = \"author\",\n",
    "        hash_bucket_size = number_of_authors + 1)\n",
    "\n",
    "    # Embed author into a lower dimensional representation\n",
    "    embedded_author_column = tf.feature_column.embedding_column(\n",
    "        categorical_column = author_column,\n",
    "        dimension = args[\"author_embedding_dimensions\"])\n",
    "\n",
    "    # Create months since epoch boundaries list for our binning\n",
    "    months_since_epoch_boundaries = list(range(400, 700, 20))\n",
    "\n",
    "    # Create months_since_epoch feature column using raw data\n",
    "    months_since_epoch_column = tf.feature_column.numeric_column(\n",
    "        key = \"months_since_epoch\")\n",
    "\n",
    "    # Create bucketized months_since_epoch feature column using our boundaries\n",
    "    months_since_epoch_bucketized = tf.feature_column.bucketized_column(\n",
    "        source_column = months_since_epoch_column,\n",
    "        boundaries = months_since_epoch_boundaries)\n",
    "\n",
    "    # Cross our categorical category column and bucketized months since epoch column\n",
    "    crossed_months_since_category_column = tf.feature_column.crossed_column(\n",
    "        keys = [categorical_category_column, months_since_epoch_bucketized],\n",
    "        hash_bucket_size = len(months_since_epoch_boundaries) * (number_of_categories + 1))\n",
    "\n",
    "    # Convert crossed categorical category and bucketized months since epoch column into indicator column so that it can be used in a DNN\n",
    "    indicator_crossed_months_since_category_column = tf.feature_column.indicator_column(\n",
    "            categorical_column = crossed_months_since_category_column)\n",
    "\n",
    "    # Create user and item factor feature columns from our trained WALS model\n",
    "    user_factors = [tf.feature_column.numeric_column(key = \"user_factor_\" + str(i)) for i in range(10)]\n",
    "    item_factors =  [tf.feature_column.numeric_column(key = \"item_factor_\" + str(i)) for i in range(10)]\n",
    "\n",
    "    # Create list of feature columns\n",
    "    feature_columns = [embedded_content_column,\n",
    "    embedded_author_column,\n",
    "    indicator_category_column,\n",
    "    embedded_title_column,\n",
    "    indicator_crossed_months_since_category_column] + user_factors + item_factors\n",
    "\n",
    "    return feature_columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we'll create our model function"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create custom model function for our custom estimator\n",
    "def model_fn(features, labels, mode, params):\n",
    "    # TODO: Create neural network input layer using our feature columns defined above\n",
    "\n",
    "    # TODO: Create hidden layers by looping through hidden unit list\n",
    "\n",
    "    # TODO: Compute logits (1 per class) using the output of our last hidden layer\n",
    "\n",
    "    # TODO: Find the predicted class indices based on the highest logit (which will result in the highest probability)\n",
    "    predicted_classes = \n",
    "\n",
    "  # Read in the content id vocabulary so we can tie the predicted class indices to their respective content ids\n",
    "    with file_io.FileIO(tf.gfile.Glob(filename = \"gs://{}/hybrid_recommendation/preproc/vocabs/content_id_vocab.txt*\".format(BUCKET))[0], mode = \"r\") as ifp:\n",
    "        content_id_names = tf.constant(value = [x.rstrip() for x in ifp])\n",
    "\n",
    "    # Gather predicted class names based predicted class indices\n",
    "    predicted_class_names = tf.gather(params = content_id_names, indices = predicted_classes)\n",
    "\n",
    "    # If the mode is prediction\n",
    "    if mode == tf.estimator.ModeKeys.PREDICT:\n",
    "        # Create predictions dict\n",
    "        predictions_dict = {\n",
    "            \"class_ids\": tf.expand_dims(input = predicted_classes, axis = -1),\n",
    "            \"class_names\" : tf.expand_dims(input = predicted_class_names, axis = -1),\n",
    "            \"probabilities\": tf.nn.softmax(logits = logits),\n",
    "            \"logits\": logits\n",
    "        }\n",
    "\n",
    "        # Create export outputs\n",
    "        export_outputs = {\"predict_export_outputs\": tf.estimator.export.PredictOutput(outputs = predictions_dict)}\n",
    "\n",
    "        return tf.estimator.EstimatorSpec( # return early since we\"re done with what we need for prediction mode\n",
    "          mode = mode,\n",
    "          predictions = predictions_dict,\n",
    "          loss = None,\n",
    "          train_op = None,\n",
    "          eval_metric_ops = None,\n",
    "          export_outputs = export_outputs)\n",
    "\n",
    "    # Continue on with training and evaluation modes\n",
    "\n",
    "    # Create lookup table using our content id vocabulary\n",
    "    table = tf.contrib.lookup.index_table_from_file(\n",
    "        vocabulary_file = tf.gfile.Glob(filename = \"gs://{}/hybrid_recommendation/preproc/vocabs/content_id_vocab.txt*\".format(BUCKET))[0])\n",
    "\n",
    "    # Look up labels from vocabulary table\n",
    "    labels = table.lookup(keys = labels)\n",
    "\n",
    "    # TODO: Compute loss using the correct type of softmax cross entropy since this is classification and our labels (content id indices) and probabilities are mutually exclusive\n",
    "    loss = \n",
    "\n",
    "    # If the mode is evaluation\n",
    "    if mode == tf.estimator.ModeKeys.EVAL:\n",
    "        # Compute evaluation metrics of total accuracy and the accuracy of the top k classes\n",
    "        accuracy = tf.metrics.accuracy(labels = labels, predictions = predicted_classes, name = \"acc_op\")\n",
    "        top_k_accuracy = tf.metrics.mean(values = tf.nn.in_top_k(predictions = logits, targets = labels, k = params[\"top_k\"]))\n",
    "        map_at_k = tf.metrics.average_precision_at_k(labels = labels, predictions = predicted_classes, k = params[\"top_k\"])\n",
    "\n",
    "        # Put eval metrics into a dictionary\n",
    "        eval_metric_ops = {\n",
    "            \"accuracy\": accuracy,\n",
    "            \"top_k_accuracy\": top_k_accuracy,\n",
    "            \"map_at_k\": map_at_k}\n",
    "\n",
    "        # Create scalar summaries to see in TensorBoard\n",
    "        tf.summary.scalar(name = \"accuracy\", tensor = accuracy[1])\n",
    "        tf.summary.scalar(name = \"top_k_accuracy\", tensor = top_k_accuracy[1])\n",
    "        tf.summary.scalar(name = \"map_at_k\", tensor = map_at_k[1])\n",
    "    \n",
    "        return tf.estimator.EstimatorSpec( # return early since we\"re done with what we need for evaluation mode\n",
    "            mode = mode,\n",
    "            predictions = None,\n",
    "            loss = loss,\n",
    "            train_op = None,\n",
    "            eval_metric_ops = eval_metric_ops,\n",
    "            export_outputs = None)\n",
    "\n",
    "    # Continue on with training mode\n",
    "\n",
    "    # If the mode is training\n",
    "    assert mode == tf.estimator.ModeKeys.TRAIN\n",
    "\n",
    "    # Create a custom optimizer\n",
    "    optimizer = tf.train.AdagradOptimizer(learning_rate = params[\"learning_rate\"])\n",
    "\n",
    "    # Create train op\n",
    "    train_op = optimizer.minimize(loss = loss, global_step = tf.train.get_global_step())\n",
    "\n",
    "    return tf.estimator.EstimatorSpec( # final return since we\"re done with what we need for training mode\n",
    "        mode = mode,\n",
    "        predictions = None,\n",
    "        loss = loss,\n",
    "        train_op = train_op,\n",
    "        eval_metric_ops = None,\n",
    "        export_outputs = None)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now create a serving input function"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create serving input function\n",
    "def serving_input_fn():  \n",
    "    feature_placeholders = {\n",
    "        colname : tf.placeholder(dtype = tf.string, shape = [None]) \\\n",
    "        for colname in NON_FACTOR_COLUMNS[1:-1]\n",
    "    }\n",
    "    feature_placeholders[\"months_since_epoch\"] = tf.placeholder(dtype = tf.float32, shape = [None])\n",
    "\n",
    "    for colname in FACTOR_COLUMNS:\n",
    "        feature_placeholders[colname] = tf.placeholder(dtype = tf.float32, shape = [None])\n",
    "\n",
    "    features = {\n",
    "        key: tf.expand_dims(tensor, -1) \\\n",
    "        for key, tensor in feature_placeholders.items()\n",
    "    }\n",
    "\n",
    "    return tf.estimator.export.ServingInputReceiver(features = features, receiver_tensors = feature_placeholders)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that all of the pieces are assembled let's create and run our train and evaluate loop"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create train and evaluate loop to combine all of the pieces together.\n",
    "tf.logging.set_verbosity(tf.logging.INFO)\n",
    "def train_and_evaluate(args):\n",
    "    estimator = tf.estimator.Estimator(\n",
    "        model_fn = model_fn,\n",
    "        model_dir = args[\"output_dir\"],\n",
    "        params = {\n",
    "        \"feature_columns\": create_feature_columns(args),\n",
    "        \"hidden_units\": args[\"hidden_units\"],\n",
    "        \"n_classes\": number_of_content_ids,\n",
    "        \"learning_rate\": args[\"learning_rate\"],\n",
    "        \"top_k\": args[\"top_k\"],\n",
    "        \"bucket\": args[\"bucket\"]\n",
    "        }\n",
    "    )\n",
    "\n",
    "    train_spec = tf.estimator.TrainSpec(\n",
    "        input_fn = read_dataset(filename = args[\"train_data_paths\"], mode = tf.estimator.ModeKeys.TRAIN, batch_size = args[\"batch_size\"]),\n",
    "        max_steps = args[\"train_steps\"])\n",
    "\n",
    "    exporter = tf.estimator.LatestExporter(name = \"exporter\", serving_input_receiver_fn = serving_input_fn)\n",
    "\n",
    "    eval_spec = tf.estimator.EvalSpec(\n",
    "        input_fn = read_dataset(filename = args[\"eval_data_paths\"], mode = tf.estimator.ModeKeys.EVAL, batch_size = args[\"batch_size\"]),\n",
    "        steps = None,\n",
    "        start_delay_secs = args[\"start_delay_secs\"],\n",
    "        throttle_secs = args[\"throttle_secs\"],\n",
    "        exporters = exporter)\n",
    "\n",
    "    tf.estimator.train_and_evaluate(estimator = estimator, train_spec = train_spec, eval_spec = eval_spec)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Run train_and_evaluate!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Call train and evaluate loop\n",
    "import shutil\n",
    "\n",
    "outdir = \"hybrid_recommendation_trained\"\n",
    "shutil.rmtree(path = outdir, ignore_errors = True) # start fresh each time\n",
    "\n",
    "arguments = {\n",
    "    \"bucket\": BUCKET,\n",
    "    \"train_data_paths\": \"gs://{}/hybrid_recommendation/preproc/features/train.csv*\".format(BUCKET),\n",
    "    \"eval_data_paths\": \"gs://{}/hybrid_recommendation/preproc/features/eval.csv*\".format(BUCKET),\n",
    "    \"output_dir\": outdir,\n",
    "    \"batch_size\": 128,\n",
    "    \"learning_rate\": 0.1,\n",
    "    \"hidden_units\": [256, 128, 64],\n",
    "    \"content_id_embedding_dimensions\": 10,\n",
    "    \"author_embedding_dimensions\": 10,\n",
    "    \"top_k\": 10,\n",
    "    \"train_steps\": 1000,\n",
    "    \"start_delay_secs\": 30,\n",
    "    \"throttle_secs\": 30\n",
    "}\n",
    "\n",
    "train_and_evaluate(arguments)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Run on module locally\n",
    "\n",
    "Now let's place our code into a python module with model.py and task.py files so that we can train using Google Cloud's ML Engine! First, let's test our module locally."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%writefile requirements.txt\n",
    "tensorflow_hub"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "echo \"bucket=${BUCKET}\"\n",
    "rm -rf hybrid_recommendation_trained\n",
    "export PYTHONPATH=${PYTHONPATH}:${PWD}/hybrid_recommendations_module\n",
    "python -m trainer.task \\\n",
    "    --bucket=${BUCKET} \\\n",
    "    --train_data_paths=gs://${BUCKET}/hybrid_recommendation/preproc/features/train.csv* \\\n",
    "    --eval_data_paths=gs://${BUCKET}/hybrid_recommendation/preproc/features/eval.csv* \\\n",
    "    --output_dir=${OUTDIR} \\\n",
    "    --batch_size=128 \\\n",
    "    --learning_rate=0.1 \\\n",
    "    --hidden_units=\"256 128 64\" \\\n",
    "    --content_id_embedding_dimensions=10 \\\n",
    "    --author_embedding_dimensions=10 \\\n",
    "    --top_k=10 \\\n",
    "    --train_steps=1000 \\\n",
    "    --start_delay_secs=30 \\\n",
    "    --throttle_secs=60"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Run on Google Cloud AI Platform\n",
    "If our module locally trained fine, let's now use of the power of AI Platform to scale it out on Google Cloud."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "OUTDIR=gs://${BUCKET}/hybrid_recommendation/small_trained_model\n",
    "JOBNAME=hybrid_recommendation_$(date -u +%y%m%d_%H%M%S)\n",
    "echo $OUTDIR $REGION $JOBNAME\n",
    "gcloud storage rm --recursive --continue-on-error $OUTDIR\n",    "gcloud ml-engine jobs submit training $JOBNAME \\\n",
    "    --region=$REGION \\\n",
    "    --module-name=trainer.task \\\n",
    "    --package-path=$(pwd)/hybrid_recommendations_module/trainer \\\n",
    "    --job-dir=$OUTDIR \\\n",
    "    --staging-bucket=gs://$BUCKET \\\n",
    "    --scale-tier=STANDARD_1 \\\n",
    "    --runtime-version=$TFVERSION \\\n",
    "    -- \\\n",
    "    --bucket=${BUCKET} \\\n",
    "    --train_data_paths=gs://${BUCKET}/hybrid_recommendation/preproc/features/train.csv* \\\n",
    "    --eval_data_paths=gs://${BUCKET}/hybrid_recommendation/preproc/features/eval.csv* \\\n",
    "    --output_dir=${OUTDIR} \\\n",
    "    --batch_size=128 \\\n",
    "    --learning_rate=0.1 \\\n",
    "    --hidden_units=\"256 128 64\" \\\n",
    "    --content_id_embedding_dimensions=10 \\\n",
    "    --author_embedding_dimensions=10 \\\n",
    "    --top_k=10 \\\n",
    "    --train_steps=1000 \\\n",
    "    --start_delay_secs=30 \\\n",
    "    --throttle_secs=30"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's add some hyperparameter tuning!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%writefile hyperparam.yaml\n",
    "trainingInput:\n",
    "    hyperparameters:\n",
    "        goal: MAXIMIZE\n",
    "        maxTrials: 5\n",
    "        maxParallelTrials: 1\n",
    "        hyperparameterMetricTag: accuracy\n",
    "        params:\n",
    "            - parameterName: batch_size\n",
    "              type: INTEGER\n",
    "              minValue: 8\n",
    "              maxValue: 64\n",
    "              scaleType: UNIT_LINEAR_SCALE\n",
    "            - parameterName: learning_rate\n",
    "              type: DOUBLE\n",
    "              minValue: 0.01\n",
    "              maxValue: 0.1\n",
    "              scaleType: UNIT_LINEAR_SCALE\n",
    "            - parameterName: hidden_units\n",
    "              type: CATEGORICAL\n",
    "              categoricalValues: [\"1024 512 256\", \"1024 512 128\", \"1024 256 128\", \"512 256 128\", \"1024 512 64\", \"1024 256 64\", \"512 256 64\", \"1024 128 64\", \"512 128 64\", \"256 128 64\", \"1024 512 32\", \"1024 256 32\", \"512 256 32\", \"1024 128 32\", \"512 128 32\", \"256 128 32\", \"1024 64 32\", \"512 64 32\", \"256 64 32\", \"128 64 32\"]\n",
    "            - parameterName: content_id_embedding_dimensions\n",
    "              type: INTEGER\n",
    "              minValue: 5\n",
    "              maxValue: 250\n",
    "              scaleType: UNIT_LOG_SCALE\n",
    "            - parameterName: author_embedding_dimensions\n",
    "              type: INTEGER\n",
    "              minValue: 5\n",
    "              maxValue: 30\n",
    "              scaleType: UNIT_LINEAR_SCALE"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "OUTDIR=gs://${BUCKET}/hybrid_recommendation/hypertuning\n",
    "JOBNAME=hybrid_recommendation_$(date -u +%y%m%d_%H%M%S)\n",
    "echo $OUTDIR $REGION $JOBNAME\n",
    "gcloud storage rm --recursive --continue-on-error $OUTDIR\n",    "gcloud ml-engine jobs submit training $JOBNAME \\\n",
    "    --region=$REGION \\\n",
    "    --module-name=trainer.task \\\n",
    "    --package-path=$(pwd)/hybrid_recommendations_module/trainer \\\n",
    "    --job-dir=$OUTDIR \\\n",
    "    --staging-bucket=gs://$BUCKET \\\n",
    "    --scale-tier=STANDARD_1 \\\n",
    "    --runtime-version=$TFVERSION \\\n",
    "    --config=hyperparam.yaml \\\n",
    "    -- \\\n",
    "    --bucket=${BUCKET} \\\n",
    "    --train_data_paths=gs://${BUCKET}/hybrid_recommendation/preproc/features/train.csv* \\\n",
    "    --eval_data_paths=gs://${BUCKET}/hybrid_recommendation/preproc/features/eval.csv* \\\n",
    "    --output_dir=${OUTDIR} \\\n",
    "    --batch_size=128 \\\n",
    "    --learning_rate=0.1 \\\n",
    "    --hidden_units=\"256 128 64\" \\\n",
    "    --content_id_embedding_dimensions=10 \\\n",
    "    --author_embedding_dimensions=10 \\\n",
    "    --top_k=10 \\\n",
    "    --train_steps=1000 \\\n",
    "    --start_delay_secs=30 \\\n",
    "    --throttle_secs=30"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we know the best hyperparameters, run a big training job!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "OUTDIR=gs://${BUCKET}/hybrid_recommendation/big_trained_model\n",
    "JOBNAME=hybrid_recommendation_$(date -u +%y%m%d_%H%M%S)\n",
    "echo $OUTDIR $REGION $JOBNAME\n",
    "gcloud storage rm --recursive --continue-on-error $OUTDIR\n",    "gcloud ml-engine jobs submit training $JOBNAME \\\n",
    "    --region=$REGION \\\n",
    "    --module-name=trainer.task \\\n",
    "    --package-path=$(pwd)/hybrid_recommendations_module/trainer \\\n",
    "    --job-dir=$OUTDIR \\\n",
    "    --staging-bucket=gs://$BUCKET \\\n",
    "    --scale-tier=STANDARD_1 \\\n",
    "    --runtime-version=$TFVERSION \\\n",
    "    -- \\\n",
    "    --bucket=${BUCKET} \\\n",
    "    --train_data_paths=gs://${BUCKET}/hybrid_recommendation/preproc/features/train.csv* \\\n",
    "    --eval_data_paths=gs://${BUCKET}/hybrid_recommendation/preproc/features/eval.csv* \\\n",
    "    --output_dir=${OUTDIR} \\\n",
    "    --batch_size=128 \\\n",
    "    --learning_rate=0.1 \\\n",
    "    --hidden_units=\"256 128 64\" \\\n",
    "    --content_id_embedding_dimensions=10 \\\n",
    "    --author_embedding_dimensions=10 \\\n",
    "    --top_k=10 \\\n",
    "    --train_steps=10000 \\\n",
    "    --start_delay_secs=30 \\\n",
    "    --throttle_secs=30"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
